hpr3328 :: Pandas Part 2

Enigma continues his discussion about his favorite Python module Pandas

Hosted by Enigma on 2021-05-05 is flagged as Clean and is released under a CC-BY-SA license.
Tags: python, pandas, Data, Data Science. Comments: 2.
The show is available on the Internet Archive at: https://archive.org/details/hpr3328

Listen in ogg, spx, or mp3 format. Play now:

Duration: 00:11:59

Part of the series: A Little Bit of Python.

Initially based on the podcast "A Little Bit of Python", by Michael Foord, Andrew Kuchling, Steve Holden, Dr. Brett Cannon and Jesse Noller. https://www.voidspace.org.uk/python/weblog/arch_d7_2009_12_19.shtml#e1138

Now the series is open to all.

Part two in the For the Love of Data series. Enigma covers part 2 of Pandas
The following topics are discussed

1) Another way to apply a condition to a field
2) Creating a DataFrame from a dictionary
3) Appending a data frame with another DataFrame
4) Joining DataFrames with merge and join
5) Writing an output to csv

Part 2 Sample code
Follow me on twitter @Ed_N1gma

Come chat on irc.freenode.net #hackerexchange

Show Transcript

Automatically generated using whisper

whisper --model tiny --language en hpr3328.wav

You can save these subtitle files to the same location as the HPR Episode, and they will automatically show in players like mpv, vlc. Some players allow you to specify the subtitle file location.

<< First, < Previous, Next >, Latest >>

Comments

Subscribe to the comments RSS feed.

Comment #1 posted on 2021-05-05 19:49:39 by b-yeezi

Another great show

Thanks for another great show. I look forward to your next one.

As to your use of `pd.apply` in lieu of `np.select`, here's my 2 cents:

Apply is more readable in most cases, but select is more performant. When performance matters, or when the dataset is very large, you might want to use `np.select`. For instance, when using `np.select` on your example here, the output was 10x faster on my PC.

``` %timeit df.apply(Scorelevel, axis=1)

448 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ```

``` %timeit np.select(cond_list, choice_list, default='Require Activation')

55.6 µs ± 440 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ```

In many cases, the readability can trump the need for speed, but just wanted to give a counter-point.

Comment #2 posted on 2021-05-05 19:58:07 by b-yeezi

One more speed gain

If you really want to fly, you can turn the pandas series to numpy arrays first. For you example, it got twice as 2x faster than regular `np.select`.

Example: ``` cond_list = [df['Score'].values >= 9, ((df['Score'].values >= 8) & (df['Score'].values < 9)), ((df['Score'].values >= 7) & (df['Score'].values < 8)), ((df['Score'].values >= 6) & (df['Score'].values < 7)), ((df['Score'].values >= 5) & (df['Score'].values < 6)), ((df['Score'].values >= 4) & (df['Score'].values < 5))]

%timeit np.select(cond_list, choice_list, default='Require Activation') 23.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ```

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Your Name/Handle:
Title:
Comment:
Anti Spam Question:	What does the letter P in HPR stand for?
Are you a spammer?	Yes No
What is the HOST_ID for the host of this show?
What does HPR mean to you?

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes every weekday Monday through Friday.
This page was generated by The HPR Robot at Fri, 14 Jun 2024 13:09:08 +0000